Skip to content

Enrich StickyPostgresqlQueueListenerAgent.CheckHealthAsync with listener-state signals (#2647)#2650

Merged
jeremydmiller merged 1 commit intomainfrom
2647-enrich-sticky-pg-listener-health
May 1, 2026
Merged

Enrich StickyPostgresqlQueueListenerAgent.CheckHealthAsync with listener-state signals (#2647)#2650
jeremydmiller merged 1 commit intomainfrom
2647-enrich-sticky-pg-listener-health

Conversation

@jeremydmiller
Copy link
Copy Markdown
Member

Closes #2647.

Summary

StickyPostgresqlQueueListenerAgent relied on the default IAgent.CheckHealthAsyncStatus==Running ? Healthy : Unhealthy — which hides the cases that matter most operationally for sticky per-tenant listeners: the tenant's database becoming unreachable, the listener latching due to errors, or a backlog growing on a particular tenant queue.

CheckHealthAsync now layers three signals on top of Status:

  1. Per-tenant database reachabilitySELECT 1 against the assigned NpgsqlDataSource via TenantedPostgresqlQueue.PingDatabaseAsync. One failure ⇒ Degraded with the underlying error message; 3 consecutive failures ⇒ Unhealthy. Localizes the symptom to the specific node + tenant pair that's misbehaving — sticky listeners are by definition pinned, so generic "listener is running" health doesn't catch this.

  2. Listener latch state — mirrors what ExclusiveListenerAgent.CheckHealthAsync does: ask the runtime for the underlying IListeningAgent and translate ListeningStatus.TooBusy ⇒ Degraded, ListeningStatus.GloballyLatched ⇒ Unhealthy.

  3. Per-tenant queue depthTenantedPostgresqlQueue.GetQueueDepthAsync runs SELECT COUNT(*) against the parent queue table on the assigned tenant DB. When the parent endpoint sets a BufferingLimits ceiling, depth ≥ Maximum ⇒ Degraded. Skipped silently when no buffering limit is configured.

Multiple Degraded reasons aggregate into a ;-joined description so monitoring tools (CritterWatch's Agents tab) see the full picture in one tooltip; an Unhealthy reason takes precedence and pulls Degraded reasons along.

API surface

Two internal helpers added on TenantedPostgresqlQueue so the agent stays free of raw NpgsqlCommand plumbing:

  • PingDatabaseAsync(CancellationToken)SELECT 1
  • GetQueueDepthAsync(CancellationToken)SELECT COUNT(*) on the parent queue table

A test-only internal int ConsecutiveDbFailureCount accessor on the agent for diagnostics.

Test plan

  • PostgresqlTests/Transport/sticky_listener_health_tests covers:
    • Status precedence (Stopped ⇒ Unhealthy)
    • No-endpoint short-circuit (Healthy before StartAsync wires up _tenantEndpoint)
    • Description content (mentions queue + tenant)
    • PingDatabaseAsync against real Postgres (Should.NotThrow)
    • PingDatabaseAsync against a deliberately-broken connection string (Should.Throw)
    • GetQueueDepthAsync returning 0 for an empty table
    • GetQueueDepthAsync reflecting actual inserted-row counts
  • 7/7 sticky-listener tests pass locally against the wolverine-postgresql-1 docker container.
  • Brought NSubstitute into PostgresqlTests so the no-runtime tests can stub IWolverineRuntime cheaply.

🤖 Generated with Claude Code

StickyPostgresqlQueueListenerAgent relied on the default IAgent.CheckHealthAsync —
Status==Running ? Healthy : Unhealthy — which hides the cases that matter most
operationally for sticky per-tenant listeners: the tenant's database becoming
unreachable, the listener latching due to errors, or a backlog growing on a
particular tenant queue.

CheckHealthAsync now layers three persistence signals on top of the agent's
Status:

1. Per-tenant database reachability — `SELECT 1` against the assigned
   NpgsqlDataSource via TenantedPostgresqlQueue.PingDatabaseAsync. One failure
   ⇒ Degraded with the underlying error message; ConsecutiveDbFailureUnhealthyThreshold
   (3) consecutive failures ⇒ Unhealthy. Localizes the symptom to the specific
   node + tenant pair that's misbehaving — sticky listeners are by definition
   pinned, so generic "listener is running" health doesn't catch this.

2. Listener latch state — mirrors what ExclusiveListenerAgent.CheckHealthAsync
   does: ask the runtime for the underlying IListeningAgent and translate
   ListeningStatus.TooBusy ⇒ Degraded, ListeningStatus.GloballyLatched ⇒
   Unhealthy. Same descriptions style (`Listener {queue}/{db}…`) so an operator
   sees a consistent shape.

3. Per-tenant queue depth — TenantedPostgresqlQueue.GetQueueDepthAsync runs
   `SELECT COUNT(*)` against the parent queue table on the assigned tenant DB.
   When the parent endpoint sets a BufferingLimits ceiling, depth ≥ Maximum ⇒
   Degraded. Skipped silently when no buffering limit is configured.

Multiple Degraded reasons aggregate into a `;`-joined description so monitoring
tools see the full picture in one tooltip; an Unhealthy reason takes precedence
and pulls Degraded reasons along.

Two helpers landed on TenantedPostgresqlQueue to keep the agent free of raw
NpgsqlCommand plumbing: `PingDatabaseAsync` and `GetQueueDepthAsync`. Both are
internal so the agent assembly can use them and tests in PostgresqlTests can
exercise them directly.

Test plan:

* PostgresqlTests/Transport/sticky_listener_health_tests covers:
  - Status precedence (Stopped ⇒ Unhealthy)
  - No-endpoint short-circuit (Healthy before StartAsync wires up _tenantEndpoint)
  - Description content (mentions queue + tenant)
  - PingDatabaseAsync against real Postgres (Should.NotThrow)
  - PingDatabaseAsync against a deliberately-broken connection string (Should.Throw)
  - GetQueueDepthAsync returning 0 for an empty table
  - GetQueueDepthAsync reflecting actual inserted-row counts

* All 7 sticky-listener tests pass locally against the wolverine-postgresql-1
  Docker container. Brought NSubstitute into PostgresqlTests so the no-runtime
  tests can stub IWolverineRuntime cheaply.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jeremydmiller jeremydmiller merged commit 0426185 into main May 1, 2026
19 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enrich StickyPostgresqlQueueListenerAgent.CheckHealthAsync with listener-state signals

1 participant